from google.colab import drive
drive.mount("/content/gdrive/")
%cd /content/gdrive/MyDrive/data_science/mile_stone2/
!ls
Mounted at /content/gdrive/ /content/gdrive/MyDrive/data_science/mile_stone2 'Copy of milestone2.ipynb' milestone2.html milestone2.ipynb
Performing an analysis on the historical and earnings data of over 50 tech companies, visualizing trends, and patterns. The goal is to find the most crucial features to do best trading recommendation strategies or stock price prediction.
I find three datasets realted to this project. One dataset is Stock History about stock prices and volumes, two are about Earning Report: one is from Macrotrends(https://www.macrotrends.net/) and the other is from Yahoo Finance(https://finance.yahoo.com/).
I got my hands on the historical data of these companies as TABLE1, including their stock prices and volumes over the years, using a library called yfinance(https://pypi.org/project/yfinance/). TABLE1 is the main dataset we need to focus on, it will describe how the stock price move.
We’re also looking at earnings data from Macrotrends(TABLE2) and Yahoo Finance(TABLE3) to get an idea of how these companies are doing money-wise, and if they're growing or not. From Macrotrends, I’ve got:
and we will think about the question: Is Revenue the most important feature to decide the how the stock moves? What is the relationship between this feature and other features?
From Yahoo Finance, I’ve got the earnings calendar that includes:
and we will think about the question: Is Reported EPS the most important feature to decide the how the stock moves? Compared with other features, Is this a better determinant of stock movement?
I’m adding in some trend indicators like MACD, RSI, and Bollinger Bands to help us spot patterns and trends in the data more easily by using pandas_ta(https://github.com/twopirllc/pandas-ta) library.
We tidy TABLE1 and can get these indicators, and we will consider questions:
Which one of MACD, RSI, Bollinger Bands is the important feature to help us decide the the how to operate stocks strategies? What is the relationship between one feature and other features?
Currently, only me do the project. Since no others to work together, I plan to use Google Drive and Google Colab as my development platform.
Milestone1: 1. get three datasets used for future 2.Tidy each table 3. do three EDAs for three tables
Milestone2: 1. consturct learning models, for example, Contextual Bandits or Deep Reinforcement Learning as the model to do trading recommendation strategies. 2. do more EDAs for useful features extraction.
Final: 1. test different features 2. compare the final results and conclusion
We use yfinance(https://pypi.org/project/yfinance/) to get the stock daily prices data and use pandas_ta(https://github.com/twopirllc/pandas-ta) to get indicators to be used for future analysis
!pip install pandas_ta
!pip install python-dateutil
import yfinance as yf
import pandas as pd
import matplotlib.pyplot as plt
import requests
from bs4 import BeautifulSoup
import pandas_ta as ta
from dateutil import parser
import time
import seaborn as sns
import warnings
import numpy as np
warnings.filterwarnings('ignore')
Collecting pandas_ta
Downloading pandas_ta-0.3.14b.tar.gz (115 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 115.1/115.1 kB 2.0 MB/s eta 0:00:00
Preparing metadata (setup.py) ... done
Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from pandas_ta) (1.5.3)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas->pandas_ta) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->pandas_ta) (2023.3.post1)
Requirement already satisfied: numpy>=1.21.0 in /usr/local/lib/python3.10/dist-packages (from pandas->pandas_ta) (1.23.5)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas->pandas_ta) (1.16.0)
Building wheels for collected packages: pandas_ta
Building wheel for pandas_ta (setup.py) ... done
Created wheel for pandas_ta: filename=pandas_ta-0.3.14b0-py3-none-any.whl size=218907 sha256=bc5537b435ac1f4ced1898faf4c058abffd7480744234cac2ae62e72eabac7ed
Stored in directory: /root/.cache/pip/wheels/69/00/ac/f7fa862c34b0e2ef320175100c233377b4c558944f12474cf0
Successfully built pandas_ta
Installing collected packages: pandas_ta
Successfully installed pandas_ta-0.3.14b0
Requirement already satisfied: python-dateutil in /usr/local/lib/python3.10/dist-packages (2.8.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil) (1.16.0)
Columns Open, High, Low, Close, Adj Close are about stock price. The volume is the number of shares that were traded during a given day.
# Load data from Yahoo Finance
ticker = "AAPL"
stock_data = yf.download(ticker, start="2020-01-01", end="2023-01-01")
# Display the data
display(stock_data.head())
print(stock_data.shape)
display(stock_data.dtypes)
[*********************100%%**********************] 1 of 1 completed
| Open | High | Low | Close | Adj Close | Volume | |
|---|---|---|---|---|---|---|
| Date | ||||||
| 2020-01-02 | 74.059998 | 75.150002 | 73.797501 | 75.087502 | 73.249016 | 135480400 |
| 2020-01-03 | 74.287498 | 75.144997 | 74.125000 | 74.357498 | 72.536888 | 146322800 |
| 2020-01-06 | 73.447502 | 74.989998 | 73.187500 | 74.949997 | 73.114883 | 118387200 |
| 2020-01-07 | 74.959999 | 75.224998 | 74.370003 | 74.597504 | 72.771027 | 108872000 |
| 2020-01-08 | 74.290001 | 76.110001 | 74.290001 | 75.797501 | 73.941628 | 132079200 |
(756, 6)
Open float64 High float64 Low float64 Close float64 Adj Close float64 Volume int64 dtype: object
Above dataset is the example of one stock, now we extend it to 50 stocks
We have chosen 50 tech company tickers for our analysis.
tickers = [
"AAPL", "MSFT", "AMZN", "GOOGL", "META", "TSLA", "NVDA", "PYPL", "INTC", "CSCO",
"ADBE", "CRM", "ACN", "TXN", "ORCL", "QCOM", "IBM", "AVGO", "AMD", "SAP",
"SHOP", "NOW", "SQ", "INTU", "FIS", "VMW", "HPE", "HPQ", "MU", "AMAT",
"LRCX", "KLAC", "ADI", "MCHP", "CDNS", "ANSS", "KEYS", "TEL", "WDAY", "CTSH",
"DOCU", "SNPS", "FTNT", "GLW", "VRSN", "AKAM", "SWKS", "TTD", "EPAM", "NET"
]
MACD (Moving Average Convergence Divergence): MACD is a trend-following momentum indicator that shows the relationship between two moving averages of a stock’s price.
MACD Signal: The MACD signal line is a 9-day EMA of the MACD. When the MACD crosses above the signal line, it gives a bullish (buy) signal, and when it crosses below, it gives a bearish (sell) signal.
MACD Histogram: The MACD histogram is the difference between the MACD and its signal line. A positive histogram suggests that bulls are in control (buying pressure), while a negative histogram suggests bears have control (selling pressure).
RSI (Relative Strength Index): RSI measures the magnitude of recent price changes to evaluate overbought or oversold conditions in the price of a stock. Readings above 70 indicate that a stock may be overbought, and readings below 30 indicate that the stock may be oversold.
Bollinger Upper Band: The upper Bollinger Band is two standard deviations above the simple moving average (SMA) of a stock's price. It helps identify when a stock’s price is considered "high."
Bollinger Middle Band: The middle Bollinger Band is the SMA of the stock’s price, often serving as support or resistance levels for the stock price.
Bollinger Lower Band: The lower Bollinger Band is two standard deviations below the SMA of the stock’s price. It helps identify when a stock’s price is considered "low."
def fetch_and_enhance_stock_data(tickers):
all_data = []
for ticker in tickers:
try:
# Fetch historical stock data
stock_data = yf.download(ticker, start="2010-01-01", end="2023-01-01")
# Calculate MACD
macd = ta.macd(stock_data['Close'])
stock_data['MACD'] = macd.iloc[:, 0]
stock_data['MACD Signal'] = macd.iloc[:, 1]
stock_data['MACD Histogram'] = macd.iloc[:, 2]
# Calculate RSI
stock_data['RSI'] = ta.rsi(stock_data['Close'])
# Calculate Bollinger Bands
bbands = ta.bbands(stock_data['Close'])
stock_data['Bollinger Lower'] = bbands.iloc[:, 0]
stock_data['Bollinger Middle'] = bbands.iloc[:, 1]
stock_data['Bollinger Upper'] = bbands.iloc[:, 2]
stock_data['Stock'] = ticker
cols = ['Stock'] + [col for col in stock_data if col != 'Stock']
stock_data = stock_data[cols]
# Drop nan
all_data.append(stock_data.dropna())
except Exception as e:
print(f"Could not retrieve data for {ticker}: {str(e)}")
continue
# Concatenate all the individual DataFrames into one
combined_data = pd.concat(all_data, axis=0)
return combined_data
# Fetch and enhance the stock data
stock_data = fetch_and_enhance_stock_data(tickers)
table1 = stock_data.copy()
display(stock_data.head(10))
print(stock_data.shape)
display(stock_data.dtypes)
[*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed
| Stock | Open | High | Low | Close | Adj Close | Volume | MACD | MACD Signal | MACD Histogram | RSI | Bollinger Lower | Bollinger Middle | Bollinger Upper | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Date | ||||||||||||||
| 2010-02-22 | AAPL | 7.226429 | 7.232143 | 7.113929 | 7.157857 | 6.075565 | 390563600 | -0.088343 | 0.066146 | -0.154489 | 47.617442 | 7.145997 | 7.221214 | 7.296432 |
| 2010-02-23 | AAPL | 7.142857 | 7.190357 | 6.989643 | 7.037857 | 5.973710 | 575094800 | -0.091067 | 0.050738 | -0.141805 | 43.102673 | 7.024712 | 7.175929 | 7.327145 |
| 2010-02-24 | AAPL | 7.079643 | 7.194286 | 7.065714 | 7.166429 | 6.082839 | 460566400 | -0.081906 | 0.047918 | -0.129825 | 48.713452 | 7.022721 | 7.162429 | 7.302136 |
| 2010-02-25 | AAPL | 7.049286 | 7.245000 | 7.031786 | 7.214286 | 6.123461 | 665126000 | -0.069979 | 0.047877 | -0.117856 | 50.663657 | 7.030473 | 7.155786 | 7.281099 |
| 2010-02-26 | AAPL | 7.227857 | 7.327500 | 7.214286 | 7.307857 | 6.202885 | 507460800 | -0.052372 | 0.052387 | -0.104759 | 54.321085 | 7.001694 | 7.176857 | 7.352020 |
| 2010-03-01 | AAPL | 7.348214 | 7.482143 | 7.337500 | 7.463929 | 6.335357 | 550093600 | -0.025530 | 0.063383 | -0.088913 | 59.688942 | 6.952883 | 7.238072 | 7.523261 |
| 2010-03-02 | AAPL | 7.497500 | 7.529643 | 7.419286 | 7.458929 | 6.331113 | 566546400 | -0.004608 | 0.067444 | -0.072052 | 59.447917 | 7.077504 | 7.322286 | 7.567068 |
| 2010-03-03 | AAPL | 7.462143 | 7.495357 | 7.426429 | 7.476071 | 6.345663 | 372052800 | 0.013204 | 0.068205 | -0.055001 | 60.043600 | 7.174299 | 7.384214 | 7.594130 |
| 2010-03-04 | AAPL | 7.474286 | 7.532857 | 7.451071 | 7.525357 | 6.387496 | 366041200 | 0.030940 | 0.068753 | -0.037813 | 61.781887 | 7.300079 | 7.446429 | 7.592779 |
| 2010-03-05 | AAPL | 7.676429 | 7.846429 | 7.665357 | 7.819643 | 6.637286 | 899620400 | 0.067959 | 0.084617 | -0.016658 | 70.136225 | 7.273868 | 7.548786 | 7.823704 |
(146264, 14)
Stock object Open float64 High float64 Low float64 Close float64 Adj Close float64 Volume int64 MACD float64 MACD Signal float64 MACD Histogram float64 RSI float64 Bollinger Lower float64 Bollinger Middle float64 Bollinger Upper float64 dtype: object
A function to scrape earnings data from Macrotrends(https://www.macrotrends.net/) for each stock. We can use Apple Quarterly Revenue as a long-term investment decision.
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
def fetch_macrotrends_earnings_data(ticker):
url = f'https://www.macrotrends.net/stocks/charts/{ticker.lower()}/{ticker}/revenue'
response = requests.get(url, headers=headers)
if response.status_code != 200:
print(f"Failed to retrieve the data for {ticker}")
return None
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find_all('table')[1]
earnings_data = pd.read_html(str(table))[0]
return earnings_data
apple_earnings_data = fetch_macrotrends_earnings_data('AAPL')
display(apple_earnings_data.head(10))
print(apple_earnings_data.shape)
display(apple_earnings_data.dtypes)
| Apple Quarterly Revenue (Millions of US $) | Apple Quarterly Revenue (Millions of US $).1 | |
|---|---|---|
| 0 | 2023-06-30 | $81,797 |
| 1 | 2023-03-31 | $94,836 |
| 2 | 2022-12-31 | $117,154 |
| 3 | 2022-09-30 | $90,146 |
| 4 | 2022-06-30 | $82,959 |
| 5 | 2022-03-31 | $97,278 |
| 6 | 2021-12-31 | $123,945 |
| 7 | 2021-09-30 | $83,360 |
| 8 | 2021-06-30 | $81,434 |
| 9 | 2021-03-31 | $89,584 |
(58, 2)
Apple Quarterly Revenue (Millions of US $) object Apple Quarterly Revenue (Millions of US $).1 object dtype: object
def clean_macrotrends_earnings_data(df, ticker):
df.columns = ['Date', 'Revenue']
df['Date'] = pd.to_datetime(df['Date'])
df['Revenue'] = df['Revenue'].str.replace('$', '').str.replace(',', '').astype(float)
df['Stock'] = ticker
return df
# Fetch and clean earnings data for all stocks
all_earnings_data = []
for ticker in tickers:
raw_earnings_data = fetch_macrotrends_earnings_data(ticker)
if raw_earnings_data is not None:
cleaned_earnings_data = clean_macrotrends_earnings_data(raw_earnings_data, ticker)
all_earnings_data.append(cleaned_earnings_data)
# Combine all earnings data into a single DataFrame
combined_earnings_data = pd.concat(all_earnings_data, axis=0)
table2 = combined_earnings_data.copy()
display(combined_earnings_data.head(10))
print(combined_earnings_data.shape)
display(combined_earnings_data.dtypes)
| Date | Revenue | Stock | |
|---|---|---|---|
| 0 | 2023-06-30 | 81797.0 | AAPL |
| 1 | 2023-03-31 | 94836.0 | AAPL |
| 2 | 2022-12-31 | 117154.0 | AAPL |
| 3 | 2022-09-30 | 90146.0 | AAPL |
| 4 | 2022-06-30 | 82959.0 | AAPL |
| 5 | 2022-03-31 | 97278.0 | AAPL |
| 6 | 2021-12-31 | 123945.0 | AAPL |
| 7 | 2021-09-30 | 83360.0 | AAPL |
| 8 | 2021-06-30 | 81434.0 | AAPL |
| 9 | 2021-03-31 | 89584.0 | AAPL |
(2707, 3)
Date datetime64[ns] Revenue float64 Stock object dtype: object
We can find that Apple's quarterly revenue is the highest compared with MSFT and GOOGL. And it is increasing year by year, we can infer that Apple's stock price should rise year by year and the increase rate may be higher than MSFT and GOOGL.
# Display basic statistics
print(combined_earnings_data.describe())
# Plotting the revenue over time for Apple
plt.figure(figsize=(10, 6))
apple_data = combined_earnings_data[combined_earnings_data['Stock'] == 'AAPL']
plt.plot(apple_data['Date'], apple_data['Revenue'], marker='o', linestyle='-', color='b')
plt.title('Apple Quarterly Revenue Over Time')
plt.xlabel('Date')
plt.ylabel('Revenue (in millions)')
plt.grid(True)
plt.show()
# Plotting the revenue distributions for Apple, Microsoft, and Google
selected_stocks = ['AAPL', 'MSFT', 'GOOGL']
selected_data = combined_earnings_data[combined_earnings_data['Stock'].isin(selected_stocks)]
plt.figure(figsize=(10, 6))
sns.boxplot(x='Stock', y='Revenue', data=selected_data)
plt.title('Revenue Distribution for Selected Stocks')
plt.xlabel('Stock')
plt.ylabel('Revenue (in millions)')
plt.show()
Revenue count 2679.000000 mean 7341.410228 std 14817.978635 min -26397.000000 25% 705.000000 50% 2230.000000 75% 7286.500000 max 149204.000000
A function to scrape earnings data from Yahoo Finance(https://finance.yahoo.com/) for each stock. We can use Reported EPS as a long-term investment decision.
def fetch_yahoo_earnings_data(ticker):
url = f'https://finance.yahoo.com/calendar/earnings?symbol={ticker}'
response = requests.get(url, headers=headers)
if response.status_code != 200:
print(f"Failed to retrieve the data for {ticker}")
return None
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find('table')
earnings_data = pd.read_html(str(table))[0]
return earnings_data
msft_yahoo_earnings_data = fetch_yahoo_earnings_data('MSFT')
display(msft_yahoo_earnings_data.head(10))
print(msft_yahoo_earnings_data.shape)
display(msft_yahoo_earnings_data.dtypes)
| Symbol | Company | Earnings Date | EPS Estimate | Reported EPS | Surprise(%) | |
|---|---|---|---|---|---|---|
| 0 | MSFT | Microsoft Corp | Oct 22, 2024, 6 AMEDT | - | - | - |
| 1 | MSFT | Microsoft Corp | Jul 23, 2024, 6 AMEDT | - | - | - |
| 2 | MSFT | Microsoft Corp | Apr 23, 2024, 6 AMEDT | - | - | - |
| 3 | MSFT | Microsoft Corp | Jan 22, 2024, 4 PMEST | 2.78 | - | - |
| 4 | MSFT | Microsoft Corp | Jan 22, 2024, 5 AMEST | 2.78 | - | - |
| 5 | MSFT | Microsoft Corporation | Oct 24, 2023, 12 PMEDT | 2.65 | 2.99 | +12.7 |
| 6 | MSFT | Microsoft Corporation | Jul 25, 2023, 12 PMEDT | 2.55 | 2.69 | +5.49 |
| 7 | MSFT | Microsoft Corporation | Apr 25, 2023, 12 PMEDT | 2.23 | 2.45 | +9.81 |
| 8 | MSFT | Microsoft Corporation | Jan 24, 2023, 11 AMEST | 2.29 | 2.32 | +1.09 |
| 9 | MSFT | Microsoft Corporation | Oct 25, 2022, 12 PMEDT | 2.3 | 2.35 | +2.05 |
(100, 6)
Symbol object Company object Earnings Date object EPS Estimate object Reported EPS object Surprise(%) object dtype: object
def parse_date(date_string):
try:
return parser.parse(date_string)
except Exception as e:
return None
def clean_yahoo_earnings_data(df, ticker):
cleaned_data = []
for index, row in df.iterrows():
try:
if row['EPS Estimate'] == '-' or row['Reported EPS'] == '-' or row['Surprise(%)'] == '-':
continue
date_string = row['Earnings Date']
earnings_date = parse_date(date_string)
eps_estimate = float(row['EPS Estimate']) if row['EPS Estimate'] != '-' else None
reported_eps = float(row['Reported EPS']) if row['Reported EPS'] != '-' else None
surprise = float(row['Surprise(%)'].replace('%', '')) if row['Surprise(%)'] != '-' else None
cleaned_data.append([ticker, earnings_date, eps_estimate, reported_eps, surprise])
except Exception as e:
continue
cleaned_df = pd.DataFrame(cleaned_data, columns=['Stock', 'Earnings Date', 'EPS Estimate', 'Reported EPS', 'Surprise(%)'])
return cleaned_df
all_yahoo_earnings_data = []
for ticker in tickers:
raw_yahoo_earnings_data = fetch_yahoo_earnings_data(ticker)
if raw_yahoo_earnings_data is not None and not raw_yahoo_earnings_data.empty:
cleaned_yahoo_earnings_data = clean_yahoo_earnings_data(raw_yahoo_earnings_data, ticker)
if not cleaned_yahoo_earnings_data.empty:
all_yahoo_earnings_data.append(cleaned_yahoo_earnings_data)
time.sleep(1)
if all_yahoo_earnings_data:
combined_yahoo_earnings_data = pd.concat(all_yahoo_earnings_data, ignore_index=True)
else:
print("No data to display.")
table3 = combined_yahoo_earnings_data.copy()
display(combined_yahoo_earnings_data.head(10))
print(combined_yahoo_earnings_data.shape)
display(combined_yahoo_earnings_data.dtypes)
| Stock | Earnings Date | EPS Estimate | Reported EPS | Surprise(%) | |
|---|---|---|---|---|---|
| 0 | AAPL | 2023-11-02 12:00:00 | 1.39 | 1.46 | 4.89 |
| 1 | AAPL | 2023-08-03 12:00:00 | 1.19 | 1.26 | 5.49 |
| 2 | AAPL | 2023-05-04 12:00:00 | 1.43 | 1.52 | 6.03 |
| 3 | AAPL | 2023-02-02 11:00:00 | 1.94 | 1.88 | -2.88 |
| 4 | AAPL | 2022-10-27 12:00:00 | 1.27 | 1.29 | 1.55 |
| 5 | AAPL | 2022-07-28 12:00:00 | 1.16 | 1.20 | 3.25 |
| 6 | AAPL | 2022-04-28 12:00:00 | 1.43 | 1.52 | 6.44 |
| 7 | AAPL | 2022-01-27 11:00:00 | 1.89 | 2.10 | 11.17 |
| 8 | AAPL | 2021-10-28 12:00:00 | 1.24 | 1.24 | 0.30 |
| 9 | AAPL | 2021-07-27 12:00:00 | 1.01 | 1.30 | 29.12 |
(3431, 5)
Stock object Earnings Date datetime64[ns] EPS Estimate float64 Reported EPS float64 Surprise(%) float64 dtype: object
We can find that Reported EPS is increasing over years, so we can infer that Apple's stock price should rise year by year. One interesting thing is that despite Apple's quarterly revenue is the higher compared with MSFT and GOOGL, but Apple's Reported EPS is lower than MSFT. So we want to plot the real stock movements in the final part to compare Apple and Microsoft.
print(combined_yahoo_earnings_data.describe())
print(combined_yahoo_earnings_data.isnull().sum())
combined_yahoo_earnings_data.dropna(inplace=True)
# After fetching and cleaning the data, filter it to include only AAPL’s data
aapl_earnings_data = combined_yahoo_earnings_data[combined_yahoo_earnings_data['Stock'] == 'AAPL']
# plotting Reported EPS over time
plt.figure(figsize=(10, 6))
sns.lineplot(data=aapl_earnings_data, x='Earnings Date', y='Reported EPS')
plt.title('AAPL Reported EPS Over Time')
plt.show()
# plotting the Surprise(%) over time
plt.figure(figsize=(10, 6))
sns.lineplot(data=aapl_earnings_data, x='Earnings Date', y='Surprise(%)')
plt.title('AAPL Earnings Surprise(%) Over Time')
plt.show()
EPS Estimate Reported EPS Surprise(%) count 3431.000000 3431.000000 3431.000000 mean 0.736721 0.787753 13.914261 std 1.073613 1.122662 74.110469 min -1.580000 -1.910000 -1616.790000 25% 0.150000 0.170000 1.935000 50% 0.430000 0.460000 6.080000 75% 0.935000 1.000000 13.830000 max 10.420000 10.710000 1283.470000 Stock 0 Earnings Date 268 EPS Estimate 0 Reported EPS 0 Surprise(%) 0 dtype: int64
# Compare AAPL's Reported EPS with other companies
others_earnings_data = combined_yahoo_earnings_data[combined_yahoo_earnings_data['Stock'] == 'MSFT']
google_earnings_data = combined_yahoo_earnings_data[combined_yahoo_earnings_data['Stock'] == 'GOOGL']
plt.figure(figsize=(10, 6))
sns.lineplot(data=aapl_earnings_data, x='Earnings Date', y='Reported EPS', marker="o", label='AAPL')
sns.lineplot(data=others_earnings_data, x='Earnings Date', y='Reported EPS', marker="o", label='MSFT')
sns.lineplot(data=google_earnings_data, x='Earnings Date', y='Reported EPS', marker="o", label='GOOGL')
plt.title('Comparing AAPL Reported EPS with Other Companies')
plt.grid(True)
plt.show()
We compare the stock price variation between APPL and MSFT. The interesting thing is while Apple's stock has grown more, Microsoft's has risen by a larger multiple. So from this analysis, Reported EPS seems more important.
# Extract stock data with AAPL and MSFT
stock_data = fetch_and_enhance_stock_data(['AAPL', 'MSFT'])
aapl_data = stock_data[stock_data['Stock'] == 'AAPL']
msft_data = stock_data[stock_data['Stock'] == 'MSFT']
# Plotting the data
plt.figure(figsize=(10,6))
sns.lineplot(data=aapl_data, x=aapl_data.index, y='Close', label='AAPL')
sns.lineplot(data=msft_data, x=msft_data.index, y='Close', label='MSFT')
plt.title('Stock Price Variation Over Years between AAPL and MSFT')
plt.ylabel('Stock Price')
plt.xlabel('Date')
plt.grid(True)
plt.legend()
plt.show()
[*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed
We summarize our findings, insights, and observations from the data analysis and visualizations. This can be expanded upon further analysis and interpretation. We can focus on Reported EPS and in the future we will consider more features and use our reinforcement learning model to compare the results between different features.
I combine all three datasets (Table1: stock_data, Table2: combined_earnings_data, and Table3: combined_yahoo_earnings_data) and conduct a deep EDA to uncover relationships between important features. This will inform the development of a future stock recommendation model, whether it's based on Reinforcement Learning (RL) or classic prediction models.
First, we'll merge the datasets on common columns (Date and Stock). Since Table1 and Table3 have a one-to-one relationship on the date of earnings release, we can merge them directly. The Table2 will be joined based on the nearest date since earnings data is quarterly and stock data is daily.
table1_final = table1.copy()
table2_final = table2.copy()
table3_final = table3.copy()
# Reset index for table1_final
table1_final.reset_index(inplace=True)
table1_final['Date'] = pd.to_datetime(table1_final['Date']).dt.date
# Convert 'Earnings Date' to datetime format
table3_final['Earnings Date'] = pd.to_datetime(table3_final['Earnings Date'])
# Extract just the date part from 'Earnings Date'
table3_final['Earnings Date'] = table3_final['Earnings Date'].dt.date
table3_final.rename(columns={'Earnings Date': 'Date'}, inplace=True)
Here, after merging, the columns of 'EPS Estimate', 'Reported EPS', 'Surprise(%)', 'Revenue' will have Nan value, since EPS is reported quarterly but stock data is daily. We choose to drop Nan values because firstly we don't know the true variant EPS between different quarters and after droping Nan values we still have enough data.
# Merge table1_final with table3_final on the Date
table1_final['Date'] = pd.to_datetime(table1_final['Date'])
table3_final['Date'] = pd.to_datetime(table3_final['Date'])
merged_data = pd.merge(table1_final, table3_final, how='left', left_on=['Date', 'Stock'], right_on=['Date', 'Stock'])
# Merge merged_data with table2_final on the nearest Date
table2_final['Date'] = pd.to_datetime(table2_final['Date'])
merged_data['Date'] = pd.to_datetime(merged_data['Date'])
merged_data = pd.merge_asof(merged_data.sort_values('Date'), table2_final.sort_values('Date'), by='Stock', on='Date', direction='nearest')
# Move 'Stock' column to the first position
cols = ['Stock'] + [col for col in merged_data.columns if col != 'Stock']
merged_data = merged_data[cols]
# Drop Nan
merged_data.dropna(subset=['EPS Estimate', 'Reported EPS', 'Surprise(%)', 'Revenue'], inplace=True)
merged_data.reset_index(drop=True, inplace=True)
# Display merged dataset
display(merged_data.head())
print(merged_data.shape)
display(merged_data.dtypes)
| Stock | Date | Open | High | Low | Close | Adj Close | Volume | MACD | MACD Signal | MACD Histogram | RSI | Bollinger Lower | Bollinger Middle | Bollinger Upper | EPS Estimate | Reported EPS | Surprise(%) | Revenue | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | CRM | 2010-02-24 | 17.080000 | 17.504999 | 17.022499 | 17.360001 | 17.360001 | 10460400 | -0.075746 | 0.271555 | -0.347302 | 59.118764 | 16.990480 | 17.204500 | 17.418519 | 0.04 | 0.04 | 6.89 | 354.0 |
| 1 | AVGO | 2010-02-24 | 17.500000 | 17.870001 | 17.410000 | 17.740000 | 12.838190 | 852300 | -0.254461 | 0.146585 | -0.401046 | 52.805132 | 17.210859 | 17.490000 | 17.769142 | 0.30 | 0.34 | 14.09 | 456.0 |
| 2 | ANSS | 2010-02-25 | 43.099998 | 43.639999 | 42.029999 | 43.619999 | 43.619999 | 462800 | 0.160317 | 0.380058 | -0.219741 | 58.903683 | 42.637163 | 43.400000 | 44.162837 | 0.49 | 0.53 | 8.38 | 136.0 |
| 3 | ADBE | 2010-03-23 | 35.259998 | 35.419998 | 34.689999 | 35.220001 | 35.220001 | 8504500 | 0.240266 | 0.015448 | 0.224818 | 55.163065 | 34.452261 | 35.216000 | 35.979740 | 0.37 | 0.40 | 7.34 | 859.0 |
| 4 | ACN | 2010-03-25 | 42.099998 | 42.480000 | 41.500000 | 41.520000 | 32.030968 | 4643600 | 0.221055 | 0.011567 | 0.209488 | 48.319988 | 41.338574 | 42.040001 | 42.741428 | 0.61 | 0.60 | -2.17 | 5538.0 |
(1866, 19)
Stock object Date datetime64[ns] Open float64 High float64 Low float64 Close float64 Adj Close float64 Volume int64 MACD float64 MACD Signal float64 MACD Histogram float64 RSI float64 Bollinger Lower float64 Bollinger Middle float64 Bollinger Upper float64 EPS Estimate float64 Reported EPS float64 Surprise(%) float64 Revenue float64 dtype: object
# Summary Statistics
print(merged_data.describe())
print(merged_data['Stock'].value_counts())
Open High Low Close Adj Close \
count 1866.000000 1866.000000 1866.000000 1866.000000 1866.000000
mean 87.573812 88.862316 86.207862 87.596345 82.460398
std 96.143209 97.640704 94.534965 96.116943 94.378782
min 1.534000 1.544667 1.504000 1.522667 1.522667
25% 27.435000 27.799999 26.800001 27.382751 23.735375
50% 54.294498 55.224998 53.271999 54.395000 49.689671
75% 113.280001 114.735001 111.997498 113.344999 104.345003
max 675.700012 684.830017 663.609985 664.760010 664.760010
Volume MACD MACD Signal MACD Histogram RSI \
count 1.866000e+03 1866.000000 1866.000000 1866.000000 1866.000000
mean 3.679789e+07 0.450076 0.027224 0.422851 54.458778
std 9.468753e+07 3.164273 0.930740 2.976642 11.593190
min 8.830000e+04 -35.798676 -12.289617 -30.238577 12.441018
25% 3.269575e+06 -0.179459 -0.102475 -0.196597 46.566632
50% 9.460000e+06 0.255798 0.025647 0.251278 55.009380
75% 3.531965e+07 1.034519 0.214732 0.936849 62.957277
max 1.880998e+09 20.520009 6.946169 22.839408 83.747668
Bollinger Lower Bollinger Middle Bollinger Upper EPS Estimate \
count 1866.000000 1866.000000 1866.000000 1866.000000
mean 84.660355 87.482621 90.304887 0.905418
std 93.020314 96.339188 99.732342 1.147128
min 1.449265 1.541333 1.561554 -0.500000
25% 26.486310 27.268725 28.109967 0.290000
50% 52.364663 54.250000 55.958033 0.600000
75% 109.404774 112.955000 116.540095 1.080000
max 662.692464 680.662012 698.631559 10.280000
Reported EPS Surprise(%) Revenue
count 1866.000000 1866.000000 1866.000000
mean 0.968617 13.986683 8008.525188
std 1.194024 74.843757 15706.062694
min -0.870000 -1616.790000 -26397.000000
25% 0.320000 2.012500 709.250000
50% 0.640000 6.220000 1974.000000
75% 1.170000 14.057500 7486.750000
max 10.450000 1283.470000 137412.000000
ORCL 52
MU 52
CRM 51
QCOM 51
AKAM 51
NVDA 51
AMAT 51
INTC 51
TXN 51
GOOGL 51
MSFT 51
AAPL 51
LRCX 51
HPQ 50
AVGO 50
KLAC 50
IBM 50
ADBE 49
FTNT 49
INTU 49
CDNS 49
SNPS 48
SWKS 48
VMW 48
VRSN 47
AMD 46
MCHP 44
AMZN 44
TSLA 42
META 42
ANSS 40
NOW 39
WDAY 39
CTSH 33
KEYS 32
HPE 28
PYPL 28
SQ 26
ADI 22
GLW 18
ACN 18
FIS 18
TTD 18
DOCU 17
EPAM 12
NET 8
Name: Stock, dtype: int64
Stock Price (Open, High, Low, Close, Adj Close): The count is consistent across these variables, indicating no missing values. The mean, median (50%), and standard deviation (std) provide a sense of the central tendency and spread of the data. We can find the average closing price (mean) is about 87.60, but std is 96.11, suggesting that stock prices across different stocks vary widely.
Volume: The trading volume also varies greatly (std is much higher than the mean), and there's a huge range (min to max), indicating that some stocks are traded much more frequently than others.
Technical Indicators (MACD, RSI, Bollinger Bands): These are typical indicators used in stock price analysis. The RSI, for example, ranges from 12.44 to 83.74, with a mean of 54.46, suggesting that on average, stocks are neither overbought nor oversold.
EPS and Revenue: There's a lot of variability in EPS (Earnings Per Share) and revenue as well. The negative minimum values for Reported EPS and Revenue indicate losses for some quarters.
Surprise(%): This shows how much the actual earnings differed from the estimates. The huge range (from -1616.79% to 1283.47%) indicates some significant surprises in earnings reports.
For counting number, some stocks like 'NET' and 'EPAM' have fewer data points, which could be due to them being listed on the stock exchange more recently compared to others.
So base on the above analysis, the next steps in the EDA process will focus on visualizing and understanding the relationships and patterns in the data.
From above statistics analysis, we know variance is really large. The figure below shows the top 10 stocks by variance.
import numpy as np
# Calculate variance for each stock
stock_variance = merged_data.groupby('Stock')['Close'].var()
# Sort stocks by variance in descending order and pick top 10
top_10_stocks = stock_variance.sort_values(ascending=False).head(10).index
plt.figure(figsize=(15, 6))
# Plot only for top 10 stocks with highest variance
for stock in top_10_stocks:
subset = merged_data[merged_data['Stock'] == stock]
plt.plot(subset['Date'], subset['Close'], label=stock)
plt.xlabel('Date')
plt.ylabel('Close Price')
plt.title('Stock Prices Over Time for Top 10 Stocks by Variance')
plt.legend(loc='upper left', bbox_to_anchor=(1, 1))
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
From above results, we can find even the variance of excellent companies such as TSLA and META is also very large. In addition, it is not difficult to see that after 2022, some stocks will have a big decline while others like AVGO will have a small decline, which reflects the importance of stock trading strategy and the necessity of stock recommendation.
Visualize the distribution of stock prices to understand their range and common values.
plt.figure(figsize=(10, 6))
sns.histplot(data=merged_data, x='Close', kde=True)
plt.title('Distribution of Closing Prices')
plt.show()
From above results, we can know that the right-skewed distribution of stock prices in our dataset suggests that most stocks are clustered at lower prices, with a few outliers exhibiting significantly higher prices. The concentration of stocks at lower price points may represent more accessible investment opportunities for a broader range of investors, while the higher-priced outliers might be suitable for investors with a larger capital base. We will do the Volume Analysis to verify our findings.
Explore the relationship between the volume traded and the stock price, as high trading volumes can often lead to significant price movements.
plt.figure(figsize=(10, 6))
sns.regplot(x='Volume', y='Close', data=merged_data, scatter_kws={'alpha':0.3}, lowess=True)
# Setting the x-axis to a logarithmic scale for better visualization
plt.xscale('log')
plt.title('Volume Traded vs. Closing Price')
plt.xlabel('Volume Traded')
plt.ylabel('Closing Price')
plt.show()
The plot indicates a concentration of data points towards the lower end of the y-axis (Closing Price) and a widespread along the x-axis (Volume Traded). This pattern suggests that stocks with lower closing prices tend to have a broader range of trading volumes, while those with higher closing prices have relatively less variation in their trading volumes. From above analysis, we consider the volumes and history of stock price as the import features to do the future recommendation. The stock with lower price with high volumes tends to trade more frequently.
Explore the relationship between the MACD or RSI and the stock price, as they are import indicators can often indicate significant price movements.
# Plotting MACD vs. Stock Price
plt.figure(figsize=(14, 6))
plt.subplot(1, 2, 1)
sns.regplot(x='MACD', y='Close', data=merged_data, lowess=True, scatter_kws={'alpha':0.3})
plt.title('MACD vs. Closing Price')
plt.xlabel('MACD')
plt.ylabel('Closing Price')
# Plotting RSI vs. Stock Price
plt.subplot(1, 2, 2)
sns.regplot(x='RSI', y='Close', data=merged_data, lowess=True, scatter_kws={'alpha':0.3})
plt.title('RSI vs. Closing Price')
plt.xlabel('RSI')
plt.ylabel('Closing Price')
plt.tight_layout()
plt.show()
From above plots, the MACD (Moving Average Convergence Divergence) plot shows a upside down pointy end pattern centered around the zero line. When MACD crosses above zero, it often signals an increasing positive momentum (bullish trend), and stocks may see rising prices. This information can be crucial for a stock recommendation system.
RSI Analysis: The RSI (Relative Strength Index) plot appears as a flat, positive line, suggesting that there isn't a strong, clear relationship between RSI values and stock prices. This might indicate that RSI alone may not be a sufficient predictor for stock price movement in this dataset. RSI may need to be combined with other indicators or used in specific contexts to provide meaningful insights for stock recommendations. In the feature engineering, I will use two methods to do future tasks, ignoring RSI and combining it with other features to see the difference.
Explore the relationship between the Reported EPS or Surprise(%) or Revenue and the stock price.
# Plotting Reported EPS vs. Stock Price
plt.figure(figsize=(18, 6))
plt.subplot(1, 3, 1)
sns.scatterplot(x='Reported EPS', y='Close', data=merged_data, alpha=0.3)
sns.regplot(x='Reported EPS', y='Close', data=merged_data, lowess=True, scatter=False, color='red')
plt.title('Reported EPS vs. Closing Price')
plt.xlabel('Reported EPS')
plt.ylabel('Closing Price')
# Plotting Surprise(%) vs. Stock Price
plt.subplot(1, 3, 2)
sns.scatterplot(x='Surprise(%)', y='Close', data=merged_data, alpha=0.3)
sns.regplot(x='Surprise(%)', y='Close', data=merged_data, lowess=True, scatter=False, color='red')
plt.title('Surprise(%) vs. Closing Price')
plt.xlabel('Surprise(%)')
plt.ylabel('Closing Price')
# Plotting Revenue vs. Stock Price
plt.subplot(1, 3, 3)
sns.scatterplot(x='Revenue', y='Close', data=merged_data, alpha=0.3)
sns.regplot(x='Revenue', y='Close', data=merged_data, lowess=True, scatter=False, color='red')
plt.title('Revenue vs. Closing Price')
plt.xlabel('Revenue')
plt.ylabel('Closing Price')
plt.tight_layout()
plt.show()
From above plots we can know Reported EPS vs. Stock Price: The plot for Reported EPS (Earnings Per Share) versus Stock Price reveals a strong positive relationship. This indicates that stocks with higher earnings per share tend to have higher stock prices. This is consistent with fundamental analysis principles, as EPS is a direct measure of a company's profitability on a per-share basis. A higher EPS typically signals better financial health and thus a potentially more valuable investment.
Although this feature is the strongest correlation, we still see many points deviating from the correlation line, which proves that the stock price is not only affected by Reported EPS, but also necessary to analyze other features.
Surprise(%) vs. Stock Price: The relationship between Surprise(%) and Stock Price is gently positive. Surprise(%) measures the difference between actual earnings and analyst estimates. A positive surprise indicates that the company performed better than expected, which can lead to a positive stock price reaction. However, the gentle slope suggests that while earnings surprises can influence stock prices, other factors are also at play, and the impact may not be as pronounced or consistent.
Revenue vs. Stock Price: The plot for Revenue versus Stock Price also shows a gently positive trend. This indicates that companies with higher revenues tend to have higher stock prices, but the relationship is not as strong as with EPS. This makes sense because while revenue is an essential indicator of company size and market presence, it does not account for costs and expenses. Therefore, it's not as strong a profitability indicator as EPS.
Explore the relationship between different features and conclude the final feature table.
features_subset = merged_data[['Open', 'High', 'Low', 'Close', 'Volume', 'MACD', 'RSI', 'Bollinger Upper', 'Bollinger Lower', 'EPS Estimate', 'Reported EPS', 'Surprise(%)', 'Revenue']
]
# Pair plot
sns.pairplot(features_subset)
plt.show()
# Selecting relevant financial indicators
financial_indicators = ['Open', 'High', 'Low', 'Close', 'Volume', 'MACD', 'RSI', 'Bollinger Upper', 'Bollinger Lower', 'EPS Estimate', 'Reported EPS', 'Surprise(%)', 'Revenue']
# Creating the correlation matrix
corr_matrix = merged_data[financial_indicators].corr()
# Plotting the correlation matrix
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix of Financial Indicators')
plt.show()
From above results(Scatter Matrix and Correlation Matrix), multiple features (such as Open, High, Low, Close, Bollinger Upper, and Bollinger Lower) show a perfectly linear relationship with a slope of 1, it implies that they are highly correlated, and changes in one feature are mirrored exactly by the others.
This means that these features carry very similar information for predictive modeling purposes. If they are perfectly or almost perfectly correlated, I will consider using just one of these features for future predictions to avoid redundancy.
I choose to only include the 'Close' price as prediction in my model in the feature(Open, High, Low, Close, Bollinger Upper, and Bollinger Lower), since it encapsulates the information provided by the other variables.
In addition, from previous analysis and scatter matrix, I select Volume, MACD, Reported EPS, Surprise(%), Revenue and stock history price as feature table used for future prediction. Because they are all related to stock close price especially Reported EPS with strong relationship to stock close price. RSI is the feature that we are not sure, so I decide to compare results with feature table including RSI and without RSI. From Correlation Matrix, I also find that Surprise(%), Revenue are weakly related to stock close price. So in the feature engineering, I will also compare the results with them or without them.
Based on my previous analyses and feature selection, I can frame several distinct model questions that leverage different modeling approaches:
I plan to utilize historical stock data and financial indicators to predict future closing prices of stocks. Our independent variables will include 'Volume', 'MACD', 'Reported EPS', 'Surprise(%)', 'Revenue', and historical stock prices. I will experiment with including and excluding 'RSI' to assess its impact on predictive performance. Our dependent variable will be the 'Close' price of the stock on trading for the next quarter. I will employ a regression-based supervised learning model, such as a Random Forest Regressor, to capture the non-linear relationships observed in our exploratory data analysis. Our EDA supports this approach, as it revealed strong correlations between these features and the closing price. I will split our data into training and testing sets to evaluate the model's performance and ensure it generalizes well to unseen data.
I plan to deploy a Contextual Bandits model to optimize trading decisions in the stock market. State, Actions and Reward is defined as following: State (Context): The state consists of stock features of 'Close' price, 'Volume', 'MACD', 'Reported EPS', 'Surprise(%)', and 'Revenue'.
Actions: The actions is "buy", "sell", or "hold".
Reward: The reward function is the immediate profit or loss from an action. For example, if the agent decides to buy, and the stock price increases in the next time step, the reward is positive. Conversely, if the price decreases, the reward is negative.
Our EDA supports this approach, as it revealed strong correlations between these features and the closing price. I will split our data into training and testing sets. I will train the model using training data and evaluate its performance using a testing data.
Deep Reinforcement Learning (DRL) model maybe a better model to consider. The state space will consist of features such as 'Close' price, 'Volume', 'MACD', 'Reported EPS', 'Surprise(%)', and 'Revenue', alongside the agent's portfolio state. The action space will include potential trading actions: buy, sell, or hold. Our reward function will be designed to maximize portfolio value over time, encouraging the agent to learn profitable trading strategies. The EDA conducted has demonstrated the potential of these features to inform stock price movements, suggesting that a DRL agent could leverage these relationships to make informed trading decisions. I will split our data into training and testing sets. I will train the model using historical training data and evaluate its performance using a testing data.
These models will enable us to address different aspects of stock market prediction and trading optimization. The supervised learning model will focus on predicting future prices, while the RL model will aim at making profitable trading decisions.
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
features = ['Volume', 'MACD', 'Reported EPS', 'Surprise(%)', 'Revenue']
target = 'Close'
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(merged_data[features], merged_data[target], test_size=0.2, random_state=42)
# Initialize the Random Forest Regressor
rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)
# Fit the model
rf_regressor.fit(X_train, y_train)
# Make predictions
predictions = rf_regressor.predict(X_test)
# Calculate the mean squared error
mse = mean_squared_error(y_test, predictions)
rmse = np.sqrt(mse)
print(f"Root Mean Squared Error: {rmse}")
Root Mean Squared Error: 39.561926741711375
# Actual vs Predicted plot
plt.figure(figsize=(12,6))
plt.scatter(y_test, predictions, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=4)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Actual vs. Predicted Close Prices')
plt.show()
# Plot of residuals
residuals = y_test - predictions
plt.figure(figsize=(12,6))
plt.scatter(predictions, residuals, alpha=0.5)
plt.title('Residuals vs Predicted')
plt.xlabel('Predicted')
plt.ylabel('Residuals')
plt.axhline(y=0, color='red', linestyle='--')
plt.show()
The above is the preliminary results of Random Forest Regressor. As we see, in the Figure Actual vs. Predicted Close Prices, most of points are around the black line. But still some parts of points are far from the black line. In the future, we will do more feature engineering to imporve the performance.
Contextual bandit is a Reinforcement Learning approach. We build a context bandit model to optimize trading decisions in the stock market. State, Actions and Reward is defined as following: State (Context): The state consists of stock features of 'Close' price, 'Volume', 'MACD', 'Reported EPS', 'Surprise(%)', and 'Revenue'.
Actions(Arms): The actions is "buy", "sell", or "hold".
Reward: we increase a constraint of position that means the current state of the stock holdings (positive for held stocks, negative if stocks have been sold). And trade_size is The number of stock units to buy or sell in each action; defaults to 1. The reward is the calculated profit or loss along with the updated position.
The following is the preliminary code for Contextual Bandits.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Normalize the features
scaler = StandardScaler()
feature_columns = ['Close', 'Volume', 'MACD', 'Reported EPS', 'Surprise(%)', 'Revenue']
merged_data[feature_columns] = scaler.fit_transform(merged_data[feature_columns])
# Split the dataset into train and test sets
train_data, test_data = train_test_split(merged_data, test_size=0.2, random_state=42)
class LinUCB:
def __init__(self, alpha, n_arms, n_features):
self.alpha = alpha
self.n_arms = n_arms
self.n_features = n_features
self.A = [np.identity(n_features) for _ in range(n_arms)]
self.b = [np.zeros(n_features) for _ in range(n_arms)]
def recommend_arm(self, x):
p = np.zeros(self.n_arms)
for arm in range(self.n_arms):
A_inv = np.linalg.inv(self.A[arm])
theta = A_inv @ self.b[arm]
p[arm] = theta.T @ x + self.alpha * np.sqrt(x.T @ A_inv @ x)
return np.argmax(p)
def update(self, chosen_arm, x, reward):
self.A[chosen_arm] += x @ x.T
self.b[chosen_arm] += reward * x
def get_reward(chosen_arm, current_price, next_price, positions, trade_size=1):
# Calculate the reward based on the action taken
if chosen_arm == 1: # buy
if positions <= 0:
profit_loss = (next_price - current_price) * trade_size
positions += trade_size # Update the position to reflect the purchase
else:
profit_loss = 0
elif chosen_arm == 2: # sell
if positions >= 0:
profit_loss = (current_price - next_price) * trade_size
positions -= trade_size # Update the position to reflect the sale
else:
profit_loss = 0
else: # hold
profit_loss = 0 # No profit or loss if holding
return profit_loss, positions
# Initialize positions and bandit algorithm
alpha = 1 # controls exploration-exploitation trade-off
n_arms = 3 # Number of arms: buy, sell, hold
n_features = len(feature_columns) # Number of features
bandit = LinUCB(alpha, n_arms, n_features)
# Training loop for the bandit on the training set
train_rewards = []
positions = 0
for index, row in train_data.iterrows():
if index + 1 < len(train_data):
# Extract the current and next price along with other features
current_features = np.array(row[feature_columns].values, dtype=np.float64)
current_price = row['Close']
next_price = train_data.iloc[index + 1]['Close']
chosen_arm = bandit.recommend_arm(current_features)
reward, positions = get_reward(chosen_arm, current_price, next_price, positions)
# Update the bandit
bandit.update(chosen_arm, current_features, reward)
# Keep track of the rewards
train_rewards.append(reward)
# Evaluating on the test set
test_rewards = []
positions = 0
for index, row in test_data.iterrows():
if index + 1 < len(test_data):
# Extract the current and next price along with other features
current_features = np.array(row[feature_columns].values, dtype=np.float64)
current_price = row['Close']
next_price = test_data.iloc[index + 1]['Close']
chosen_arm = bandit.recommend_arm(current_features)
reward, positions = get_reward(chosen_arm, current_price, next_price, positions)
test_rewards.append(reward)
else:
# We cannot compute the reward for the last entry in the test set
test_rewards.append(0)
# Compute cumulative rewards for plotting
cumulative_train_rewards = np.cumsum(train_rewards)
cumulative_test_rewards = np.cumsum(test_rewards)
# Plot the cumulative rewards for the training set
plt.figure(figsize=(14, 7))
plt.plot(cumulative_train_rewards, label='Train Cumulative Reward')
plt.xlabel('Time step')
plt.ylabel('Cumulative Reward')
plt.title('Cumulative Reward Over Training')
plt.legend()
plt.show()
# Plot the cumulative rewards for the test set
plt.figure(figsize=(14, 7))
plt.plot(cumulative_test_rewards, label='Test Cumulative Reward', color='orange')
plt.xlabel('Time step')
plt.ylabel('Cumulative Reward')
plt.title('Cumulative Reward Over Testing')
plt.legend()
plt.show()
The above is the preliminary results of Contextual Bandits. As we see, Cumulative Reward Over Training and Cumulative Reward Over Testing gradually increase, which proves the model works. In the future, we will do more feature engineering to imporve the performance.
!pip install --upgrade nbconvert
Requirement already satisfied: nbconvert in /usr/local/lib/python3.10/dist-packages (6.5.4)
Collecting nbconvert
Downloading nbconvert-7.11.0-py3-none-any.whl (256 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 256.6/256.6 kB 4.7 MB/s eta 0:00:00
Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (4.11.2)
Requirement already satisfied: bleach!=5.0.0 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (6.1.0)
Requirement already satisfied: defusedxml in /usr/local/lib/python3.10/dist-packages (from nbconvert) (0.7.1)
Requirement already satisfied: jinja2>=3.0 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (3.1.2)
Requirement already satisfied: jupyter-core>=4.7 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (5.5.0)
Requirement already satisfied: jupyterlab-pygments in /usr/local/lib/python3.10/dist-packages (from nbconvert) (0.2.2)
Requirement already satisfied: markupsafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (2.1.3)
Collecting mistune<4,>=2.0.3 (from nbconvert)
Downloading mistune-3.0.2-py3-none-any.whl (47 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 48.0/48.0 kB 4.6 MB/s eta 0:00:00
Requirement already satisfied: nbclient>=0.5.0 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (0.8.0)
Requirement already satisfied: nbformat>=5.7 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (5.9.2)
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from nbconvert) (23.2)
Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (1.5.0)
Requirement already satisfied: pygments>=2.4.1 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (2.16.1)
Requirement already satisfied: tinycss2 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (1.2.1)
Requirement already satisfied: traitlets>=5.1 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (5.7.1)
Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.10/dist-packages (from bleach!=5.0.0->nbconvert) (1.16.0)
Requirement already satisfied: webencodings in /usr/local/lib/python3.10/dist-packages (from bleach!=5.0.0->nbconvert) (0.5.1)
Requirement already satisfied: platformdirs>=2.5 in /usr/local/lib/python3.10/dist-packages (from jupyter-core>=4.7->nbconvert) (3.11.0)
Requirement already satisfied: jupyter-client>=6.1.12 in /usr/local/lib/python3.10/dist-packages (from nbclient>=0.5.0->nbconvert) (6.1.12)
Requirement already satisfied: fastjsonschema in /usr/local/lib/python3.10/dist-packages (from nbformat>=5.7->nbconvert) (2.18.1)
Requirement already satisfied: jsonschema>=2.6 in /usr/local/lib/python3.10/dist-packages (from nbformat>=5.7->nbconvert) (4.19.2)
Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.10/dist-packages (from beautifulsoup4->nbconvert) (2.5)
Requirement already satisfied: attrs>=22.2.0 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.7->nbconvert) (23.1.0)
Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.7->nbconvert) (2023.7.1)
Requirement already satisfied: referencing>=0.28.4 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.7->nbconvert) (0.30.2)
Requirement already satisfied: rpds-py>=0.7.1 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.7->nbconvert) (0.11.0)
Requirement already satisfied: pyzmq>=13 in /usr/local/lib/python3.10/dist-packages (from jupyter-client>=6.1.12->nbclient>=0.5.0->nbconvert) (23.2.1)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.10/dist-packages (from jupyter-client>=6.1.12->nbclient>=0.5.0->nbconvert) (2.8.2)
Requirement already satisfied: tornado>=4.1 in /usr/local/lib/python3.10/dist-packages (from jupyter-client>=6.1.12->nbclient>=0.5.0->nbconvert) (6.3.2)
Installing collected packages: mistune, nbconvert
Attempting uninstall: mistune
Found existing installation: mistune 0.8.4
Uninstalling mistune-0.8.4:
Successfully uninstalled mistune-0.8.4
Attempting uninstall: nbconvert
Found existing installation: nbconvert 6.5.4
Uninstalling nbconvert-6.5.4:
Successfully uninstalled nbconvert-6.5.4
Successfully installed mistune-3.0.2 nbconvert-7.11.0
%%shell
jupyter nbconvert --to html /content/gdrive/MyDrive/data_science/mile_stone2/milestone2.ipynb
[NbConvertApp] Converting notebook /content/gdrive/MyDrive/data_science/mile_stone2/milestone2.ipynb to html [NbConvertApp] WARNING | Alternative text is missing on 11 image(s). [NbConvertApp] Writing 1777819 bytes to /content/gdrive/MyDrive/data_science/mile_stone2/milestone2.html